Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

RBONN: Recurrent Bilinear Optimization for a Binary Neural Network

Algorithm 7 RBONN training.

Input: a minibatch of inputs and their labels, real-valued weights w, recurrent model

weights U, scaling factor matrix A, learning rates η1, η2 and η3.

Output: updated real-valued weights w^t⁺¹, updated scaling factor matrix A^t⁺¹, and up-

dated recurrent model weights U ^t⁺¹.

1: while Forward propagation do

b^w^t←sign(w^t).

b^a^t

in ←sign(a^t

in^).

Features calculation using Eq. 6.36

Loss calculation using Eq. 6.68

6: end while

7: while Backward propagation do

Computing

∂L

∂A^t^,

∂L

∂w^t^{, and}

∂L

∂U ^t^{using Eq. 6.70, 6.72, and 3.136.}

Update A^t⁺¹, w^t⁺¹, and U ^t⁺¹according to Eqs. 6.69, 6.44, and 6.50, respectively.

10: end while

where w^′= diag(∥w1∥1, · · · , ∥wCout∥1). And we judge when asynchronous convergence

occurs in optimization based on (¬D(w^′

i⁾⁾^∧^D⁽^Aⁱ^{) = 1, where the density function is}

deﬁned as

D(xi) =

if ranking(σ(x)i)>T ,

otherwise,

(3.134)

where T is deﬁned by T = int(Cout×τ). τ is the hyperparameter that denotes the threshold.

σ(x)i denotes the i-th eigenvalue of diagonal matrix x, and xi denotes the i-th row of matrix

x. Finally, we deﬁne the optimization of U as

U ^t⁺¹= |U ^t−η3

∂L

∂U ^t^|^,

(3.135)

∂L

∂U ^t⁼^∂L^S

∂w^t^◦^DReLU⁽^w^t⁻¹^,^A^t⁾^,

(3.136)

where η3 is the learning rate of U. We elaborate on the RBONN training process outlined

in Algorithm 13.

3.8.3

Discussion

In this section, we ﬁrst review the related methods on “gradient approximation” of BNNs,

then further discuss the diﬀerence of RBONN with the related methods and analyze the

eﬀectiveness of the proposed RBONN.

In particular, BNN [99] directly unitizes the Straight-Through-Estimator in the training

stage to calculate the gradient of weights and activations as

∂b^w^i,j

∂wi,j

= 1|wi,j|<1, ^∂^b^a^i,j

∂ai,j

= 1|ai,j|<1

(3.137)

which suﬀers from an obvious gradient mismatch between the gradient of the binarization

function. Intuitively, the Bi-Real Net [159] designs an approximate binarization function

that can help alleviate the gradient mismatch in backward propagation as

∂b^a^i,j

∂ai,j

⎧

⎨

⎩

1.2 + 2ai,j,

−1 ≤ai,j < 0,

2 −2ai,j,

0 ≤ai,j < 1,

10,

otherwise,

(3.138)